Please check out the following link to view HTML version of this page: http://htmlpreview.github.io/?https://github.com/kyloop/sb_capstone1/blob/master/Project%20I%20Milestone%20Report.html
The trend of e-commerce blows sky high in the recent years, a lot of mortar retail cooperation turns themselves into the competition of e-commerce (e.g. Walmart , Target etc). The concept of this project is to utilize the dataset provided by Walmart to forecast the weekly sales in 45 stores based on set of features (e.g. Store Size,Unemployment rate, Customer Price Index (CPI) etc). The first part of data analysis result is able to reflect the past or current corporation financial condition for either single store or particular department. In addition, this experiment also explores the fact that how every feature impact on the business outcome which is one of crucial operations for a coroperation. In the second part of this analysis, predictive time-series model will be implemented to forecast the future sales in weekly basis. This analysis step is basically an indispensible step for a company to have a guide or estimate of the future direction of company's profit or even market stock price which helps the management team to prepare business stragegy for the next round of period
The primary audience for this analysis is Walmart management team / executives. The findings from this research gives some hint for them to design their operation format or business plan. Furthermore, the other group of audience properly are those professional who concerns about the business condition about Walmart. The findings from the research definitly offers them some insight about the current or future sales in the mortar stores. The third group would be the people who wants to find out the sales pattern happening in Walmart store or particular department. The analysis also gives the insight how the seasonal factors changes the sales which is very valuable for professionals or merchant .
- Discover the sales distribution in Store and individual department
- Discover the correlation of between features and Sales.
- Discover the features contributes differently to the predictive models
- Determine the best model(s) for forecasting the weekly sales (continuous values)?
- How’s the models perform differently in various States?
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
display(train.head(3))
Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.
The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
cnt=0
for store in train.Store.unique():
for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
cntLst.append(train[var][train.Store == store].isnull().sum())
if cnt<=5:
print("Store #%d"%(store), end="")
print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
print(cntLst, end=" ")
print("in percent", end=" ")
print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
cntLst=[]
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
if cnt<=5:
print("Store %d %s - %s"%(store,min(arr),max(arr)))
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
if cnt<=5:
print("Store %d %s - %s"%(store,min(arr),max(arr)))
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
init_notebook_mode(connected=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Sales Values by each store (Figure 1)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Weekly Sales (in thousands)",
titlefont=dict(size=18),
zeroline=True, range=[-10,200], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Temperature[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Temperature Values by each store (Figure 2)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Temperature(F)",
titlefont=dict(size=18),
zeroline=True, range=[-5,120], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Fuel_Price[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Fuel Price Values by each store (Figure 3)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,
ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Fuel Price($)",
titlefont=dict(size=18),
zeroline=True,
range=[2,5],
showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.CPI[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of CPI Values by each store (Figure 4)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="CPI",
titlefont=dict(size=18),
zeroline=True, range=[100,250], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Unemployment[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Unemployment Rate by each store (Figure 5)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Unemployment Rate",
titlefont=dict(size=18),
zeroline=True, range=[0,20], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
Let's first take a close look on the trend of weekly sales during this two year period. The sales plot captures the sales trend ranges from Feb-2011 - Oct-2012. As we can see that the sales ranges approximate between 0 to 4 million. The sales trends for each store move flat over time except the two holiday seasons which start approximately from Nov 20 to Dec 31 for both years. As we all know, this period covers two big holidays which are Thanksgiving and Christmas retailers traditionally offers huge discount discount to customers. During these high selling period, the sales hike 1.5 - 2 times of the original sales. So that we ought to pay more attention to the prediction for this holiday period.
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
init_notebook_mode(connected=True)
#Reload Data
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
# Dropping Features
train.drop(columns=["MarkDown1","MarkDown2","MarkDown3","MarkDown4", "MarkDown5"], inplace = True)
#Gather Weekly Sales for each store
namesLst=list(map(str,train.Store.unique()))
dateLst = train.Date.unique()
aa=[]
salesLst=[]
timeLst=[]
storeSalesDict={}
for name in namesLst:
#print(name)
storeSalesDict[int(name)]=[]
for date in dateLst:
aa=round(train.Weekly_Sales[train.Store == int(name)][train.Date == date].sum()/ 1000000,2)
storeSalesDict[int(name)].append(aa)
traces=[]
for i, sales in storeSalesDict.items():
#print(i, sales)
traces.append(go.Scatter(
x= pd.to_datetime(dateLst),
y=sales,
name="Store#"+str(i)
))
import numpy as np
aa= np.zeros((len(storeSalesDict),len(storeSalesDict)))
aa = [[str(aa[i][j]) for i in range(0,aa.shape[0])] for j in range(0,aa.shape[1])]
aa = [[False for i in range(0,len(aa))] for j in range(0,len(aa))]
for i in range(len(aa)):
aa[i][i] = True
updatemenus = list([
dict(type="buttons",
active=-1,
buttons=list([
dict(label = 'Store 1',
method = 'update',
args = [{'visible': aa[0]},
{'title': 'Store 1 Sales'}])]))])
layout = go.Layout(
title = 'Weekly Sales (Figure 6)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=True,
#width=700,
#height=1000,
yaxis =dict(title = "Weekly Sales (in Millions)",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Time Series (in Months)",
titlefont=dict(size=18),
exponentformat='e',
showexponent='all',
zeroline=True,
showgrid=False,
rangeselector=dict(visible=True,
buttons=list([
dict(count=1,
label="1m",
step="month",
stepmode ="backward"),
dict(count=3,
label="3m",
step="month",
stepmode ="backward"),
dict(count=6,
label="6m",
step="month",
stepmode ="backward"),
dict(count=12,
label="12m",
step="month",
stepmode ="backward"),
dict(step ="all")])),
rangeslider = dict(visible=True)
),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False,
updatemenus=updatemenus
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
The figure 7 below shows the total sales generated by different store type (i.e. A, B, C). Obiviously, the sales in regular working holidays takes the major proportion of the sales in holidays because holidays like Christmas and Thanksgiving only takes small portion throughout the year. However, in figure 8, we could observe that the average weekly sales in holidays is higher than weekly sales in working days for Type A, B stores. Sales are basically NO variant for type C store. To be more precise determine their sales performance between holidays or non-holidays, I would propose to set up hypothesis test to further verify.
trace1 = []
salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
trace1 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA, salesB, salesC],
name = "Working day"
)
salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
trace2=[]
trace2 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA, salesB, salesC],
name = "Holiday"
)
data = [trace1, trace2]
layout = go.Layout(
title = "Store Type Sale (%s to %s) - Total (Figure 7) "% (min(train.Date), max(train.Date)),
barmode='group',
showlegend=True,
xaxis = dict(title = "Store Type"),
yaxis = dict(title = " Total Sales")
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
#salesA_len= len(list(train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False]))
salesA_len = len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==False])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesA_avg = salesA_tot/ salesA_len
salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==False])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesB_avg = salesB_tot/ salesB_len
salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==False])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
salesC_avg = salesC_tot/ salesC_len
trace1 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA_avg, salesB_avg, salesC_avg],
name = "Working day - " + str(salesA_len) + " days"
)
salesA_len= len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==True])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesA_avg = salesA_tot/ salesA_len
salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==True])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesB_avg = salesB_tot/ salesB_len
salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==True])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
salesC_avg = salesC_tot/ salesC_len
trace2 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA_avg, salesB_avg, salesC_avg],
name = "Holiday - "+str(salesA_len) + " days"
)
data = [trace1, trace2]
layout = go.Layout(
title = "Store Type Sale (%s to %s) - Weekly (Figure 8) "% (min(train.Date), max(train.Date)),
barmode='group',
showlegend=True,
xaxis = dict(title = "Store Type"),
yaxis = dict(title = " Weekly Sales ")
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
The figure 9 below shows the ranking of Department in terms of Weekly Sales. The highest weekly sales which is about $75000 is generated by Dept 92; The lowest weekly which is -#7.68 is generated by store 47. This plot could demonstrate the sales deficiency between stores.
sales_dept = []
dept_lst=[]
lst=[]
for dept in list(set(train.Dept)):
lst.append(tuple((dept,round(train.Weekly_Sales[train.Dept == dept].sum()/len(train.Weekly_Sales[train.Dept == dept]),2))))
df = pd.DataFrame(lst).sort_values(by=[1], ascending=True)
dept_lst = list(df.iloc[:,0])
sales_dept = list(df.iloc[:,1])
trace3 = []
trace3 = go.Bar(
x = list(df.iloc[:,1]),
y = list(map(str,list(df.iloc[:,0]))),
orientation="h"
#name = "Working day"
)
data = [trace3]
layout = go.Layout(
#barmode='group',
width=700,
height=1200,
#showlegend=True
title = 'Average Weekly Sales Ranking (Figure 9)',
yaxis =dict(title = "Department Number",
#exponentformat='e',
#showexponent='all',
titlefont=dict(size=18),
tick0=5,
ticks="outside",
dtick=1,
tickwidth=2,
showgrid=False,
type='category'),
xaxis = dict(title="Weekly Sales($)",
titlefont=dict(size=18),
zeroline=True,
#range=[2,5],
showgrid=True)
)
annotations=[]
for i in range(len(dept_lst)):
annotations.append(dict(x=sales_dept[i],
y=dept_lst[i],
text="%0.2f"%(sales_dept[i]),
font=dict(family='Arial',
size=12,
color='red'),
showarrow=True,
align = "center",
ax=40,
ay=0,
arrowhead=0))
layout['annotations'] = annotations
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
The following scatter plots roughly shows the correlations between the Predictor variables and Target variable.
Unfortunately, it is difficult to determine the intensity of correlation between those features against Weekly Sales except the Store Size (i.e. Seems the higher the Store Size leads to higher Store Weekly Sales). To further verify, hypotheses tests will be performed afterward.
'''*************************No need to Run*****************'''
wsLst=[]
tempLst=[]
fpLst=[]
cpiLst=[]
sizeLst=[]
storeLst=[]
dateLst=[]
for date in list(set(train.Date)):
print(date)
for store in list(set(train.Store)):
storeLst.append(store)
dateLst.append(date)
wsLst.append((train.Weekly_Sales[train.Store == store][train.Date == date]).sum())
tempLst.append(list(set(train.Temperature[train.Store == store][train.Date == date]))[0])
fpLst.append(list(set(train.Fuel_Price[train.Store == store][train.Date == date]))[0])
cpiLst.append(list(set(train.CPI[train.Store == store][train.Date == date]))[0])
sizeLst.append(list(set(train.Size[train.Store == store][train.Date == date]))[0])
pd.DataFrame(list(zip(storeLst,dateLst,wsLst, tempLst, fpLst, cpiLst, sizeLst)),
columns= ["store","date","Weekly_Sales", "Temperture", "Fuel_Price", "CPI", "Size"]).to_csv("Weekly_Sales_Corr2.csv")
'''****************Read the Weekly Sales Correlated Feature Table************'''
df = pd.read_csv("Weekly_Sales_Corr2.csv", )
ws=list(df.Weekly_Sales)
temp=list(df.Temperture)
fp=list(df.Fuel_Price)
cpi=list(df.CPI)
size=list(df.Size)
#df.head(5,)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import arange,array,ones
from scipy import stats
import seaborn as sns
plt.figure(figsize=(10,10))
sns.regplot(x="Fuel_Price", y="Weekly_Sales", data=df)
plt.title("Fuel Price vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="Temperture", y="Weekly_Sales", data=df)
plt.title("Temperature vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="CPI", y="Weekly_Sales", data=df)
plt.title("CPI vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="Size", y="Weekly_Sales", data=df)
plt.title("Store Size vs Weekly_Sales", fontsize=20)
plt.show()
With the hint from above correlation plots, the store size (Independent var.) correlates the most with the Store Weekly_Sales (Dependent var). To be able to verify this matter, we are going to show correlation values in numeric format as the following heatmap. The values reflects the same result that store size is the only feature has a correlation value higher than 0.5. The rest of the features show very least correlation with the dependent variable Weekly_sales. In this case, I would suggest to compare the prediction results from the models which are with or without least correlated features. Hypothesis Tests will be proceeded to verify this results.
For the correlation between independent variables, we could also observe that they don't have high correlation. I would not consider to remove any of the features at this point. Same as above, Hypothesis Tests will be proceeded to verify this results.
import matplotlib.pyplot as plt
import seaborn as sns
df2 = df.copy()
df2.drop(columns=["Unnamed: 0","store","date"], inplace=True)
df2_corr =df2.corr()
sns.heatmap(df2_corr, linewidths=0.5, annot=True)
plt.xticks(rotation=45)
plt.title("Correlation Heatmap", fontsize=15)
plt.show()
According to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that all of the features, except Fuel_Price, correlates with Store Weekly Sales. The result also aligns the correlation values above heatmap, e.g. Correlation of FP vs Weekly_Sales = 0.0095. As mentioned, I would not remove any features during the initial Machine Learning process even though some of the features have low correlation values against Weekly_Sales(Dependant Variable). In addition, we might want to compare the performance of models which are built with or without those least correlated features.
#Calculate pearson correlation function
def pearson_r(x,y):
corr_mat = np.corrcoef(x,y)
return corr_mat[0,1]
#Calculate P-value function
def p_value(array, value):
return (value - np.mean(array))/np.std(array)
xi = temp
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Weekly Sales is", p_val[0,1])
xi = fp
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel_Price vs Weekly Sales is", p_val[0,1])
xi = cpi
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for CPI vs Weekly Sales is", p_val[0,1])
xi = size
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Weekly Sales is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Store Size correlates with Temperature. On the other hand, Store size does not correlate with other TWO features (i.e. CPI , Fuel Price). The result also aligns the correlation values above heatmap above.
xi = size
yi = cpi
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs CPI is", p_val[0,1])
xi = size
yi = fp
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Fuel Price is", p_val[0,1])
xi = size
yi = temp
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Temperature is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.
xi = fp
yi = temp
No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Fuel Price is", p_val[0,1])
xi = cpi
yi = temp
No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs CPI is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.
xi = cpi
yi = fp
No_bootstrap_trial = len(fp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel Price vs CPI is", p_val[0,1])
The above Hypothesis tests are able to give us some hints how the independent features react with each other and how the correlations between dependent and independent features. The most correlated feature against Weekly Sales (i.e. independent variable) is Store Size and the rest of independent features are comparably less correlated. In this part, we definitly give ourselves some insights prior to the machine learning part. Based on this analysis, we could iterate models with more features combinations in order to search out the most efficient and outperformed model.
NOT all the Predictor variables are informative for the analysis due to huge amount of missing values which are dropped out (i.e. markdown1 - markdown5) for the data analysis and further predictive modeling.
Most of the instances are remained even though outliers values because those data might be reflecting seasonal patterns for the Target variable (i.e. Weekly Sales), e.g. Store sales are constantly higher in holiday seasons
Corrlations: Most of features, except Fuel_Price, are correlated with Target variable: Store Weekly Sales.
Corrlations: Predictor variables are mostly correlated with each other except Store Size vs Fuel Price.
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
display(train.head(3))
#Copy original dataset
train2 = train.copy()
train2 = train.drop(["MarkDown1", "MarkDown2","MarkDown3","MarkDown4", "MarkDown4", "MarkDown5"], axis=1)
###Dummy out Categor
train2=pd.concat((train2,pd.get_dummies(train2["IsHoliday"] , prefix="IsHoliday")),axis=1)
train2 = pd.concat((train2,pd.get_dummies(train2["Store_Type"] , prefix="Store_Type")),axis=1)
train2.drop(["IsHoliday","Store_Type"],axis=1, inplace = True)
train2.columns
train2.head(3)
# load additional module
import pickle
storeLst=[]
for store_no in sorted(list(set(train2.Store))):
for date in sorted(list(set(train2.Date))):
item=train2[train2.Store==store_no][train2.Date == date]
storeLst.append(list([store_no,
date,
round(list(set(item.Temperature))[0],2),
round(list(set(item.Fuel_Price))[0],3),
round(list(set(item.CPI))[0],5),
round(list(set(item.Unemployment))[0],5),
list(set(item.Size))[0],
list(set(item.IsHoliday_False))[0],
list(set(item.IsHoliday_True))[0],
list(set(item.Store_Type_A))[0],
list(set(item.Store_Type_B))[0],
list(set(item.Store_Type_C))[0],
round(np.divide(item.Weekly_Sales.sum(), 1000000),5)]))
with open('train2.data', 'wb') as filehandle:
# store the data as binary data stream
pickle.dump(storeLst, filehandle)
import pandas as pd
import numpy as np
import pickle
with open('train2.data', 'rb') as filehandle:
# read the data as binary data stream
storeLst = pickle.load(filehandle)
train3=pd.DataFrame(storeLst, columns=["store_no", "date","temp","fuel_price","cpi", "Unemploy","store_size",
"holiday_f","holiday_t", "type_A", "type_B", "type_C","wkly_sales" ])
train3.head(3)
new=train3.date.str.split("-",n=2,expand=True)
train3["year"] = new[0]
train3["month"] = new[1]
train3["day"] =new[2]
train3.head()
The dataset is splited into 64% and 36% proportion for Training and Testing respectively. The cut-off date is set as "2011-11-18" which means data prior to the cut-off date is used for training, otherwise data is used for testing trained model.
train4=train3[train3.date <= "2011-11-18"]
test4=train3[train3.date > "2011-11-18"]
from sklearn import preprocessing
#data is pandas dataframe
def generate_x_y(data,timestep, normalize=True):
timestep=timestep
min_max_scaler = preprocessing.MinMaxScaler()
ydata=[]
date_lst=[]
store_lst=[]
lst=[]
for store_no in sorted(list(set(data.store_no))):
# print(store_no)
ydata.append(data["wkly_sales"][data.store_no==store_no].iloc[timestep-1:])
date_lst.append(data["date"][data.store_no==store_no].iloc[timestep-1:])
store_lst.append(data["store_no"][data.store_no==store_no].iloc[timestep-1:])
arr =np.array(data[data.store_no==store_no])
reshape_length = timestep * arr.shape[1]
for i in range(arr.shape[0]-timestep+1):
s=arr[i:i+timestep]
lst.append(s.reshape(reshape_length))
df=pd.DataFrame(lst , columns=list(data.columns)*timestep)
xdata=df.drop(["store_no", "date", "wkly_sales"], axis=1)
ydata=np.array(ydata).flatten()
date_lst=np.array(date_lst).flatten()
store_lst=np.array(store_lst).flatten()
# Normalize X data
if normalize == True:
normalized_xdata=min_max_scaler.fit_transform(xdata)
return(normalized_xdata,np.array(ydata), date_lst, store_lst)
else:
return(np.array(xdata),np.array(ydata), date_lst, store_lst)
import matplotlib.pyplot as plt
%matplotlib inline
def scorePlot(window_slide_lst, train_score_lst, test_score_lst, modelName, ylabel):
df= pd.DataFrame(list(zip(window_slide_lst,
train_score_lst,
test_score_lst
)), columns=["timestep","train_score", "test_score"
])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(10,5))
plt.title(modelName)
plt.ylabel (ylabel)
plt.xlabel("Timestep / Window_Slide")
plt.xticks(window_slide_lst)
plt.show()
Before we go further to the machine learning part, let's get to know how would we measure the performance of the models that we are goind to build. The main primary metrics we are utilizing to measure the performance are $R^2$, Mean Square Error (MSE), Mearn Absolute Error (MAE).
$ R^2 $ is the coefficient of determination shows the percentage of variance in the Target variable could be explained by the predictor variables. Regardless the complexity of modelling, The score is calulated the difference between the predicted Y and true Y. Theorethically, The best score is 1; the worst is 0. In our case, we used the .score API from sklearn to achieve this task. The formula is shown as the following and further detail is provide here. $$R^2 = 1-\frac{ \sum_{i=1}^N ((yTrue - yPred)^2) }{ \sum_{i=1}^N ((yTrue - \bar{yTrue})^2)}$$
$MAE$ express the Average Absolute Error of target predicted values generalized by the model. The figure proves that the higher the MAE value means higher the error on the predicted Y. This metric has lower bound limit which is ZERO occurs when there is NO errors generalize; this metric has NO upper bound limit. MAE which is a steady metric generally offers us a way to oberve constant range of errors. The formula is shown as the following and further detail is provided here. MSE=1N∑i=1N(yTrue−yPred)2 $$MAE = \frac{1}{N}\sum_{i=1}^N |yTrue - yPred| $$
$MSE$ express the Average Square Error of target predicted valued generalized by the model. This figure also shows the higher MSE value means higher the error on the predicted Y. This metric also has ZERO lower bound limit and NO upper bound limit. However, this figure would have show higher value comparing with MAE due to its Square in the formula. This nature of property would offer a way to observe larger errors happens in the prediction. The formula is shown as the following and further detail is provided here. $$MSE = \frac{1}{N}\sum_{i=1}^N (yTrue - yPred)^2 $$
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error
This Linear Regression model is our baseline model without filtering any features out of original data. We will perform more work to check if there are more models offering higher accuracy on generalized result. The following plot shows the MSE/ MAE Metric of Linear model in a very stable trace from time-step 2 to 6. The MSE values are generally less than the MAE values which is predictable because they are in decimal form and values of MSE are closed to squared form of the values of MAE. Also, there is NO any large MSE values found from the plot which means that we could assume NOT much variance on the frequency distribution of errors. According to this result, model with timestep = 2 would be the optimal in terms of model complexity and values of error. For this reason, we will dive deeper to prune parameters and features in order to optimize the accuracy better.
from sklearn.linear_model import LinearRegression
train_reg_score_Lst=[]
test_reg_score_Lst=[]
test_reg_mse_Lst=[]
test_reg_mae_Lst=[]
window_slide_Lst =np.arange(2,7)
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )
train_reg_score_Lst.append(a)
test_reg_score_Lst.append(b)
test_reg_mse_Lst.append(c)
test_reg_mae_Lst.append(d)
#pd.DataFrame(list(zip(test_reg_mse_Lst,test_reg_mae_Lst)), columns=["MSE", "MAE"])
df= pd.DataFrame(list(zip(window_slide_Lst,
test_reg_mse_Lst,
test_reg_mae_Lst
)), columns=["timestep", "MSE", "MAE"])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(15,4))
plt.title("Linear Regression (BasedLine Model)")
plt.ylabel ("MSE / MAE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,7))
plt.show()
scorePlot(window_slide_Lst, train_reg_score_Lst, test_reg_score_Lst, "Linear Regression")
import seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(train4.iloc[:, 2:].corr(), linewidths=0.5, annot=True)
plt.show()
Based on the above baseline model, we are going to exam how the Response variable (i.e. Weekly Sales) react with combined set of predictor variables. The above correlation table shows that "store_size", "type_A" and "type_C" variables have the largest positive correlation with Target variable which is "wkly_sales". So that we starts off the modelling with Full set of feature and then narrow down into these three variables.
# Import regression modules
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.api as sm
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4,
timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4,
timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )
print(a,b,c,d)
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )
print(a,b,c,d)
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C" ,"wkly_sales"]],
timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
print(m.summary())
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )
print(a,b,c,d)
# statsmodels works nicely with pandas dataframes
# The thing inside the "quotes" is called a formula, a bit on that below
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C" ,"wkly_sales"]],
timestep=2, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=2, normalize=True)
m = sm.OLS(ytrain, xtrain).fit()
m.summary()
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
d=mean_absolute_error(ytest,reg.predict(xtest) )
print(a,b,c,d)
window_slide_Lst =np.arange(2,11)
from xgboost import XGBRegressor
train_xgb_score_Lst1=[]
test_xgb_score_Lst1=[]
test_xgb_mse_Lst1=[]
test_xgb_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4,
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4,
timestep=window_slide, normalize=True)
xgb = XGBRegressor()
xgb.fit(xtrain, ytrain)
a=xgb.score(xtrain, ytrain)
b=xgb.score(xtest, ytest)
c=mean_squared_error(ytest,xgb.predict(xtest) )
d=mean_absolute_error(ytest,xgb.predict(xtest) )
train_xgb_score_Lst1.append(a)
test_xgb_score_Lst1.append(b)
test_xgb_mse_Lst1.append(c)
test_xgb_mae_Lst1.append(d)
pd.DataFrame(list(zip(window_slide_Lst,test_xgb_mse_Lst1,test_xgb_mae_Lst1)), columns=["timestep","MSE", "MAE"])
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor", "Score")
from xgboost import XGBRegressor
train_xgb_score_Lst2=[]
test_xgb_score_Lst2=[]
test_xgb_mse_Lst2=[]
test_xgb_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xgb = XGBRegressor()
xgb.fit(xtrain, ytrain)
a=xgb.score(xtrain, ytrain)
b=xgb.score(xtest, ytest)
c=mean_squared_error(ytest,xgb.predict(xtest) )
d=mean_absolute_error(ytest,xgb.predict(xtest) )
train_xgb_score_Lst2.append(a)
test_xgb_score_Lst2.append(b)
test_xgb_mse_Lst2.append(c)
test_xgb_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst2,test_xgb_mae_Lst2)), columns=["MSE", "MAE"])
from xgboost import XGBRegressor
train_xgb_score_Lst3=[]
test_xgb_score_Lst3=[]
test_xgb_mse_Lst3=[]
test_xgb_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no","store_size", "type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size", "type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xgb = XGBRegressor()
xgb.fit(xtrain, ytrain)
a=xgb.score(xtrain, ytrain)
b=xgb.score(xtest, ytest)
c=mean_squared_error(ytest,xgb.predict(xtest) )
d=mean_absolute_error(ytest,xgb.predict(xtest) )
train_xgb_score_Lst3.append(a)
test_xgb_score_Lst3.append(b)
test_xgb_mse_Lst3.append(c)
test_xgb_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst3,test_xgb_mae_Lst3)), columns=["MSE", "MAE"])
from xgboost import XGBRegressor
train_xgb_score_Lst4=[]
test_xgb_score_Lst4=[]
test_xgb_mse_Lst4=[]
test_xgb_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no","store_size", "type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size", "type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xgb = XGBRegressor()
xgb.fit(xtrain, ytrain)
a=xgb.score(xtrain, ytrain)
b=xgb.score(xtest, ytest)
c=mean_squared_error(ytest,xgb.predict(xtest) )
d=mean_absolute_error(ytest,xgb.predict(xtest) )
train_xgb_score_Lst4.append(a)
test_xgb_score_Lst4.append(b)
test_xgb_mse_Lst4.append(c)
test_xgb_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_xgb_mse_Lst4,test_xgb_mae_Lst4)), columns=["MSE", "MAE"])
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_xgb_mse_Lst1,test_xgb_mae_Lst1,
test_xgb_mse_Lst2,test_xgb_mae_Lst2,
test_xgb_mse_Lst3,test_xgb_mae_Lst3,
test_xgb_mse_Lst4,test_xgb_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("XGB Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor")
from sklearn.linear_model import RidgeCV
train_ridge_score_Lst1=[]
test_ridge_score_Lst1=[]
test_ridge_mse_Lst1=[]
test_ridge_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
ridge.fit(xtrain, ytrain)
a=ridge.score(xtrain, ytrain)
b=ridge.score(xtest, ytest)
c=mean_squared_error(ytest,ridge.predict(xtest) )
d=mean_absolute_error(ytest,ridge.predict(xtest) )
train_ridge_score_Lst1.append(a)
test_ridge_score_Lst1.append(b)
test_ridge_mse_Lst1.append(c)
test_ridge_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_ridge_mse_Lst1,test_ridge_mae_Lst1)), columns=["MSE", "MAE"])
train_ridge_score_Lst2=[]
test_ridge_score_Lst2=[]
test_ridge_mse_Lst2=[]
test_ridge_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
ridge.fit(xtrain, ytrain)
a=ridge.score(xtrain, ytrain)
b=ridge.score(xtest, ytest)
c=mean_squared_error(ytest,ridge.predict(xtest) )
d=mean_absolute_error(ytest,ridge.predict(xtest) )
train_ridge_score_Lst2.append(a)
test_ridge_score_Lst2.append(b)
test_ridge_mse_Lst2.append(c)
test_ridge_mae_Lst2.append(d)
train_ridge_score_Lst3=[]
test_ridge_score_Lst3=[]
test_ridge_mse_Lst3=[]
test_ridge_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
ridge.fit(xtrain, ytrain)
a=ridge.score(xtrain, ytrain)
b=ridge.score(xtest, ytest)
c=mean_squared_error(ytest,ridge.predict(xtest) )
d=mean_absolute_error(ytest,ridge.predict(xtest) )
train_ridge_score_Lst3.append(a)
test_ridge_score_Lst3.append(b)
test_ridge_mse_Lst3.append(c)
test_ridge_mae_Lst3.append(d)
train_ridge_score_Lst4=[]
test_ridge_score_Lst4=[]
test_ridge_mse_Lst4=[]
test_ridge_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
ridge.fit(xtrain, ytrain)
a=ridge.score(xtrain, ytrain)
b=ridge.score(xtest, ytest)
c=mean_squared_error(ytest,ridge.predict(xtest) )
d=mean_absolute_error(ytest,ridge.predict(xtest) )
train_ridge_score_Lst4.append(a)
test_ridge_score_Lst4.append(b)
test_ridge_mse_Lst4.append(c)
test_ridge_mae_Lst4.append(d)
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_ridge_mse_Lst1,test_ridge_mae_Lst1,
test_ridge_mse_Lst2,test_ridge_mae_Lst2,
test_ridge_mse_Lst3,test_ridge_mae_Lst3,
test_ridge_mse_Lst4,test_ridge_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Ridge Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_lasso_score_Lst, test_lasso_score_Lst, "Lasso")
from sklearn.linear_model import LassoCV
train_lasso_score_Lst1=[]
test_lasso_score_Lst1=[]
test_lasso_mse_Lst1=[]
test_lasso_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
lasso.fit(xtrain, ytrain)
a=lasso.score(xtrain, ytrain)
b=lasso.score(xtest, ytest)
c=mean_squared_error(ytest,lasso.predict(xtest) )
d=mean_absolute_error(ytest,lasso.predict(xtest) )
train_lasso_score_Lst1.append(a)
test_lasso_score_Lst1.append(b)
test_lasso_mse_Lst1.append(c)
test_lasso_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_lasso_mse_Lst1,test_lasso_mae_Lst1)), columns=["MSE", "MAE"])
train_lasso_score_Lst2=[]
test_lasso_score_Lst2=[]
test_lasso_mse_Lst2=[]
test_lasso_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
lasso.fit(xtrain, ytrain)
a=lasso.score(xtrain, ytrain)
b=lasso.score(xtest, ytest)
c=mean_squared_error(ytest,lasso.predict(xtest) )
d=mean_absolute_error(ytest,lasso.predict(xtest) )
train_lasso_score_Lst2.append(a)
test_lasso_score_Lst2.append(b)
test_lasso_mse_Lst2.append(c)
test_lasso_mae_Lst2.append(d)
train_lasso_score_Lst3=[]
test_lasso_score_Lst3=[]
test_lasso_mse_Lst3=[]
test_lasso_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
lasso.fit(xtrain, ytrain)
a=lasso.score(xtrain, ytrain)
b=lasso.score(xtest, ytest)
c=mean_squared_error(ytest,lasso.predict(xtest) )
d=mean_absolute_error(ytest,lasso.predict(xtest) )
train_lasso_score_Lst3.append(a)
test_lasso_score_Lst3.append(b)
test_lasso_mse_Lst3.append(c)
test_lasso_mae_Lst3.append(d)
train_lasso_score_Lst4=[]
test_lasso_score_Lst4=[]
test_lasso_mse_Lst4=[]
test_lasso_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
lasso = LassoCV(cv=5, alphas=(0.001,0.01, 0.1,1.0))
lasso.fit(xtrain, ytrain)
a=lasso.score(xtrain, ytrain)
b=lasso.score(xtest, ytest)
c=mean_squared_error(ytest,lasso.predict(xtest) )
d=mean_absolute_error(ytest,lasso.predict(xtest) )
train_lasso_score_Lst4.append(a)
test_lasso_score_Lst4.append(b)
test_lasso_mse_Lst4.append(c)
test_lasso_mae_Lst4.append(d)
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_lasso_mse_Lst1,test_lasso_mae_Lst1,
test_lasso_mse_Lst2,test_lasso_mae_Lst2,
test_lasso_mse_Lst3,test_lasso_mae_Lst3,
test_lasso_mse_Lst4,test_lasso_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Lasso Error")
plt.ylabel("MSE / MAE")
plt.show()
from sklearn import linear_model
train_sgd_score_Lst1=[]
test_sgd_score_Lst1=[]
test_sgd_mse_Lst1=[]
test_sgd_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4,
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4,
timestep=window_slide, normalize=True)
sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
sgd.fit(xtrain,ytrain)
a=sgd.score(xtrain, ytrain)
b=sgd.score(xtest, ytest)
c=mean_squared_error(ytest,sgd.predict(xtest) )
d=mean_absolute_error(ytest,sgd.predict(xtest) )
train_sgd_score_Lst1.append(a)
test_sgd_score_Lst1.append(b)
test_sgd_mse_Lst1.append(c)
test_sgd_mae_Lst1.append(d)
scorePlot(window_slide_Lst, train_sgd_score_Lst, test_sgd_score_Lst, "SGD Regressor")
from sklearn import linear_model
train_sgd_score_Lst2=[]
test_sgd_score_Lst2=[]
test_sgd_mse_Lst2=[]
test_sgd_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
sgd.fit(xtrain,ytrain)
a=sgd.score(xtrain, ytrain)
b=sgd.score(xtest, ytest)
c=mean_squared_error(ytest,sgd.predict(xtest) )
d=mean_absolute_error(ytest,sgd.predict(xtest) )
train_sgd_score_Lst2.append(a)
test_sgd_score_Lst2.append(b)
test_sgd_mse_Lst2.append(c)
test_sgd_mae_Lst2.append(d)
from sklearn import linear_model
train_sgd_score_Lst3=[]
test_sgd_score_Lst3=[]
test_sgd_mse_Lst3=[]
test_sgd_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
sgd.fit(xtrain,ytrain)
a=sgd.score(xtrain, ytrain)
b=sgd.score(xtest, ytest)
c=mean_squared_error(ytest,sgd.predict(xtest) )
d=mean_absolute_error(ytest,sgd.predict(xtest) )
train_sgd_score_Lst3.append(a)
test_sgd_score_Lst3.append(b)
test_sgd_mse_Lst3.append(c)
test_sgd_mae_Lst3.append(d)
from sklearn import linear_model
train_sgd_score_Lst4=[]
test_sgd_score_Lst4=[]
test_sgd_mse_Lst4=[]
test_sgd_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001,max_iter=10000)
sgd.fit(xtrain,ytrain)
a=sgd.score(xtrain, ytrain)
b=sgd.score(xtest, ytest)
c=mean_squared_error(ytest,sgd.predict(xtest) )
d=mean_absolute_error(ytest,sgd.predict(xtest) )
train_sgd_score_Lst4.append(a)
test_sgd_score_Lst4.append(b)
test_sgd_mse_Lst4.append(c)
test_sgd_mae_Lst4.append(d)
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_sgd_mse_Lst1,test_sgd_mae_Lst1,
test_sgd_mse_Lst2,test_sgd_mae_Lst2,
test_sgd_mse_Lst3,test_sgd_mae_Lst3,
test_sgd_mse_Lst4,test_sgd_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("sgd Error")
plt.ylabel("MSE / MAE")
plt.show()
from sklearn import linear_model
train_bay_score_Lst1=[]
test_bay_score_Lst1=[]
test_bay_mse_Lst1=[]
test_bay_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4,
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4,
timestep=window_slide, normalize=True)
bay = linear_model.BayesianRidge()
bay.fit(xtrain, ytrain)
a=bay.score(xtrain, ytrain)
b=bay.score(xtest, ytest)
c=mean_squared_error(ytest,bay.predict(xtest) )
d=mean_absolute_error(ytest,bay.predict(xtest) )
train_bay_score_Lst1.append(a)
test_bay_score_Lst1.append(b)
test_bay_mse_Lst1.append(c)
test_bay_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_bay_mse_Lst1,test_bay_mae_Lst1)), columns=["MSE", "MAE"])
from sklearn import linear_model
train_bay_score_Lst2=[]
test_bay_score_Lst2=[]
test_bay_mse_Lst2=[]
test_bay_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
bay = linear_model.BayesianRidge()
bay.fit(xtrain, ytrain)
a=bay.score(xtrain, ytrain)
b=bay.score(xtest, ytest)
c=mean_squared_error(ytest,bay.predict(xtest) )
d=mean_absolute_error(ytest,bay.predict(xtest) )
train_bay_score_Lst2.append(a)
test_bay_score_Lst2.append(b)
test_bay_mse_Lst2.append(c)
test_bay_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_bay_mse_Lst2,test_bay_mae_Lst2)), columns=["MSE", "MAE"])
from sklearn import linear_model
train_bay_score_Lst3=[]
test_bay_score_Lst3=[]
test_bay_mse_Lst3=[]
test_bay_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
bay = linear_model.BayesianRidge()
bay.fit(xtrain, ytrain)
a=bay.score(xtrain, ytrain)
b=bay.score(xtest, ytest)
c=mean_squared_error(ytest,bay.predict(xtest) )
d=mean_absolute_error(ytest,bay.predict(xtest) )
train_bay_score_Lst3.append(a)
test_bay_score_Lst3.append(b)
test_bay_mse_Lst3.append(c)
test_bay_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_bay_mse_Lst3,test_bay_mae_Lst3)), columns=["MSE", "MAE"])
from sklearn import linear_model
train_bay_score_Lst4=[]
test_bay_score_Lst4=[]
test_bay_mse_Lst4=[]
test_bay_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
bay = linear_model.BayesianRidge()
bay.fit(xtrain, ytrain)
a=bay.score(xtrain, ytrain)
b=bay.score(xtest, ytest)
c=mean_squared_error(ytest,bay.predict(xtest) )
d=mean_absolute_error(ytest,bay.predict(xtest) )
train_bay_score_Lst4.append(a)
test_bay_score_Lst4.append(b)
test_bay_mse_Lst4.append(c)
test_bay_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_bay_mse_Lst4,test_bay_mae_Lst4)), columns=["MSE", "MAE"])
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_bay_mse_Lst1,test_bay_mae_Lst1,
test_bay_mse_Lst2,test_bay_mae_Lst2,
test_bay_mse_Lst3,test_bay_mae_Lst3,
test_bay_mse_Lst4,test_bay_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Bayesian Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_bay_score_Lst, test_bay_score_Lst, "Bayesian Regressor")
from sklearn.ensemble import RandomForestRegressor
train_rfr_score_Lst1=[]
test_rfr_score_Lst1=[]
test_rfr_mse_Lst1=[]
test_rfr_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)
rfr= RandomForestRegressor(n_estimators=100)
rfr.fit(xtrain, ytrain)
a=rfr.score(xtrain, ytrain)
b=rfr.score(xtest, ytest)
c=mean_squared_error(ytest,rfr.predict(xtest) )
d=mean_absolute_error(ytest,rfr.predict(xtest) )
train_rfr_score_Lst1.append(a)
test_rfr_score_Lst1.append(b)
test_rfr_mse_Lst1.append(c)
test_rfr_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_rfr_mse_Lst1,test_rfr_mae_Lst1)), columns=["MSE", "MAE"])
from sklearn.ensemble import RandomForestRegressor
train_rfr_score_Lst2=[]
test_rfr_score_Lst2=[]
test_rfr_mse_Lst2=[]
test_rfr_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
rfr= RandomForestRegressor(n_estimators=100)
rfr.fit(xtrain, ytrain)
a=rfr.score(xtrain, ytrain)
b=rfr.score(xtest, ytest)
c=mean_squared_error(ytest,rfr.predict(xtest) )
d=mean_absolute_error(ytest,rfr.predict(xtest) )
train_rfr_score_Lst2.append(a)
test_rfr_score_Lst2.append(b)
test_rfr_mse_Lst2.append(c)
test_rfr_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_rfr_mse_Lst2,test_rfr_mae_Lst2)), columns=["MSE", "MAE"])
from sklearn.ensemble import RandomForestRegressor
train_rfr_score_Lst3=[]
test_rfr_score_Lst3=[]
test_rfr_mse_Lst3=[]
test_rfr_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
rfr= RandomForestRegressor(n_estimators=100)
rfr.fit(xtrain, ytrain)
a=rfr.score(xtrain, ytrain)
b=rfr.score(xtest, ytest)
c=mean_squared_error(ytest,rfr.predict(xtest) )
d=mean_absolute_error(ytest,rfr.predict(xtest) )
train_rfr_score_Lst3.append(a)
test_rfr_score_Lst3.append(b)
test_rfr_mse_Lst3.append(c)
test_rfr_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_rfr_mse_Lst3,test_rfr_mae_Lst3)), columns=["MSE", "MAE"])
from sklearn.ensemble import RandomForestRegressor
train_rfr_score_Lst4=[]
test_rfr_score_Lst4=[]
test_rfr_mse_Lst4=[]
test_rfr_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
rfr= RandomForestRegressor(n_estimators=100)
rfr.fit(xtrain, ytrain)
a=rfr.score(xtrain, ytrain)
b=rfr.score(xtest, ytest)
c=mean_squared_error(ytest,rfr.predict(xtest) )
d=mean_absolute_error(ytest,rfr.predict(xtest) )
train_rfr_score_Lst4.append(a)
test_rfr_score_Lst4.append(b)
test_rfr_mse_Lst4.append(c)
test_rfr_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_rfr_mse_Lst4,test_rfr_mae_Lst4)), columns=["MSE", "MAE"])
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_rfr_mse_Lst1,test_rfr_mae_Lst1,
test_rfr_mse_Lst2,test_rfr_mae_Lst2,
test_rfr_mse_Lst3,test_rfr_mae_Lst3,
test_rfr_mse_Lst4,test_rfr_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Random Forest Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_rfr_score_Lst, test_rfr_score_Lst, "Random Forest Regressor")
from sklearn.ensemble import GradientBoostingRegressor
train_gbr_score_Lst1=[]
test_gbr_score_Lst1=[]
test_gbr_mse_Lst1=[]
test_gbr_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)
gbr= GradientBoostingRegressor(n_estimators=200)
gbr.fit(xtrain, ytrain)
a=gbr.score(xtrain, ytrain)
b=gbr.score(xtest, ytest)
c=mean_squared_error(ytest,gbr.predict(xtest) )
d=mean_absolute_error(ytest,gbr.predict(xtest) )
train_gbr_score_Lst1.append(a)
test_gbr_score_Lst1.append(b)
test_gbr_mse_Lst1.append(c)
test_gbr_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_gbr_mse_Lst1,test_gbr_mae_Lst1)), columns=["MSE", "MAE"])
from sklearn.ensemble import GradientBoostingRegressor
train_gbr_score_Lst2=[]
test_gbr_score_Lst2=[]
test_gbr_mse_Lst2=[]
test_gbr_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
gbr= GradientBoostingRegressor(n_estimators=200)
gbr.fit(xtrain, ytrain)
a=gbr.score(xtrain, ytrain)
b=gbr.score(xtest, ytest)
c=mean_squared_error(ytest,gbr.predict(xtest) )
d=mean_absolute_error(ytest,gbr.predict(xtest) )
train_gbr_score_Lst2.append(a)
test_gbr_score_Lst2.append(b)
test_gbr_mse_Lst2.append(c)
test_gbr_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_gbr_mse_Lst2,test_gbr_mae_Lst2)), columns=["MSE", "MAE"])
from sklearn.ensemble import GradientBoostingRegressor
train_gbr_score_Lst3=[]
test_gbr_score_Lst3=[]
test_gbr_mse_Lst3=[]
test_gbr_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
gbr= GradientBoostingRegressor(n_estimators=200)
gbr.fit(xtrain, ytrain)
a=gbr.score(xtrain, ytrain)
b=gbr.score(xtest, ytest)
c=mean_squared_error(ytest,gbr.predict(xtest) )
d=mean_absolute_error(ytest,gbr.predict(xtest) )
train_gbr_score_Lst3.append(a)
test_gbr_score_Lst3.append(b)
test_gbr_mse_Lst3.append(c)
test_gbr_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_gbr_mse_Lst3,test_gbr_mae_Lst3)), columns=["MSE", "MAE"])
from sklearn.ensemble import GradientBoostingRegressor
train_gbr_score_Lst4=[]
test_gbr_score_Lst4=[]
test_gbr_mse_Lst4=[]
test_gbr_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
gbr= GradientBoostingRegressor(n_estimators=200)
gbr.fit(xtrain, ytrain)
a=gbr.score(xtrain, ytrain)
b=gbr.score(xtest, ytest)
c=mean_squared_error(ytest,gbr.predict(xtest) )
d=mean_absolute_error(ytest,gbr.predict(xtest) )
ypred=gbr.predict(xtest)
train_gbr_score_Lst4.append(a)
test_gbr_score_Lst4.append(b)
test_gbr_mse_Lst4.append(c)
test_gbr_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_gbr_mse_Lst4,test_gbr_mae_Lst4)), columns=["MSE", "MAE"])
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_gbr_mse_Lst1,test_gbr_mae_Lst1,
test_gbr_mse_Lst2,test_gbr_mae_Lst2,
test_gbr_mse_Lst3,test_gbr_mae_Lst3,
test_gbr_mse_Lst4,test_gbr_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("Gradient Boost Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_gbr_score_Lst, test_gbr_score_Lst, "Gradient Boosting Regressor")
from sklearn.ensemble import AdaBoostRegressor
train_abr_score_Lst1=[]
test_abr_score_Lst1=[]
test_abr_mse_Lst1=[]
test_abr_mae_Lst1=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst1, train_store_Lst1 = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst1, test_store_Lst1 = generate_x_y(test4, timestep=window_slide, normalize=True)
abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
abr.fit(xtrain, ytrain)
a=abr.score(xtrain, ytrain)
b=abr.score(xtest, ytest)
c=mean_squared_error(ytest,abr.predict(xtest) )
d=mean_absolute_error(ytest,abr.predict(xtest) )
train_abr_score_Lst1.append(a)
test_abr_score_Lst1.append(b)
test_abr_mse_Lst1.append(c)
test_abr_mae_Lst1.append(d)
#pd.DataFrame(list(zip(test_abr_mse_Lst1,test_abr_mae_Lst1)), columns=["MSE", "MAE"])
from sklearn.ensemble import AdaBoostRegressor
train_abr_score_Lst2=[]
test_abr_score_Lst2=[]
test_abr_mse_Lst2=[]
test_abr_mae_Lst2=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst2, train_store_Lst2 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst2, test_store_Lst2 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C", "Unemploy","wkly_sales"]],
timestep=window_slide, normalize=True)
abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
abr.fit(xtrain, ytrain)
a=abr.score(xtrain, ytrain)
b=abr.score(xtest, ytest)
c=mean_squared_error(ytest,abr.predict(xtest) )
d=mean_absolute_error(ytest,abr.predict(xtest) )
train_abr_score_Lst2.append(a)
test_abr_score_Lst2.append(b)
test_abr_mse_Lst2.append(c)
test_abr_mae_Lst2.append(d)
#pd.DataFrame(list(zip(test_abr_mse_Lst2,test_abr_mae_Lst2)), columns=["MSE", "MAE"])
from sklearn.ensemble import AdaBoostRegressor
train_abr_score_Lst3=[]
test_abr_score_Lst3=[]
test_abr_mse_Lst3=[]
test_abr_mae_Lst3=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst3, train_store_Lst3 = generate_x_y(train4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst3, test_store_Lst3 = generate_x_y(test4[["date", "store_no", "store_size","type_A","type_B", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
abr.fit(xtrain, ytrain)
a=abr.score(xtrain, ytrain)
b=abr.score(xtest, ytest)
c=mean_squared_error(ytest,abr.predict(xtest) )
d=mean_absolute_error(ytest,abr.predict(xtest) )
train_abr_score_Lst3.append(a)
test_abr_score_Lst3.append(b)
test_abr_mse_Lst3.append(c)
test_abr_mae_Lst3.append(d)
#pd.DataFrame(list(zip(test_abr_mse_Lst3,test_abr_mae_Lst3)), columns=["MSE", "MAE"])
from sklearn.ensemble import AdaBoostRegressor
train_abr_score_Lst4=[]
test_abr_score_Lst4=[]
test_abr_mse_Lst4=[]
test_abr_mae_Lst4=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=window_slide, normalize=True)
abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
abr.fit(xtrain, ytrain)
a=abr.score(xtrain, ytrain)
b=abr.score(xtest, ytest)
c=mean_squared_error(ytest,abr.predict(xtest) )
d=mean_absolute_error(ytest,abr.predict(xtest) )
train_abr_score_Lst4.append(a)
test_abr_score_Lst4.append(b)
test_abr_mse_Lst4.append(c)
test_abr_mae_Lst4.append(d)
#pd.DataFrame(list(zip(test_abr_mse_Lst4,test_abr_mae_Lst4)), columns=["MSE", "MAE"])
plotDF=pd.DataFrame(list(zip(window_slide_Lst,
test_abr_mse_Lst1,test_abr_mae_Lst1,
test_abr_mse_Lst2,test_abr_mae_Lst2,
test_abr_mse_Lst3,test_abr_mae_Lst3,
test_abr_mse_Lst4,test_abr_mae_Lst4)), columns=["timestep",
"MSE-set1", "MAE-set1",
"MSE-set2", "MAE-set2",
"MSE-set3", "MAE-set3",
"MSE-set4", "MAE-set4"])
plotDF=plotDF.set_index("timestep")
plotDF.plot(use_index=True, figsize=(13,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("AdaBoost Regressor Error")
plt.ylabel("MSE / MAE")
plt.show()
scorePlot(window_slide_Lst, train_abr_score_Lst, test_abr_score_Lst, "AdaBoost Regressor")
Based on the following MSE and MAE error comparison plots, we could easily to obersve the distinctive models: XGB and Gradient Boosting Regressor are outperformed other models. The optimal window slide six seems to be constantly the best timestep for ALL models in this dataset. Timestep=6 seems reasonable for the domain of this project, model utilize the prior five weeks information to forecast the sixth day Sales. I would say that this performance based on the test set of data is improved from the baseline model by using the set 4 of features (i.e. "Store_size", "type_A", "type_C"), The error metrics MSE is reduced 0.108 from to 0.0128 while MAE is reducd from 0.238 to 0.08. The fine tuned model generates roughly four times less error than the baseline model. This result actually shows that the primary goal of this project is achieved by searching out the best model make prediction for Weekly Sales fro Walmart Stores
df= pd.DataFrame(list(zip(window_slide_Lst,
test_xgb_mae_Lst4,
test_ridge_mae_Lst4,
test_lasso_mae_Lst4,
test_sgd_mae_Lst4,
test_bay_mae_Lst4,
test_rfr_mae_Lst4,
test_gbr_mae_Lst4,
test_abr_mae_Lst4)), columns=["timestep",
"xgb", "LR-Ridge", "LR-Lasso",
"SGD Regressor",
"Bayesian Regressor",
"Random Forest Regressor",
"Gradient Boosting Regressor",
"ada Boost Regressor"
])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(15,5))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("MAE Comparision")
plt.ylabel ("MAE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,11))
plt.show()
df= pd.DataFrame(list(zip(window_slide_Lst,
test_xgb_mse_Lst4,
test_ridge_mse_Lst4,
test_lasso_mse_Lst4,
test_sgd_mse_Lst4,
test_bay_mse_Lst4,
test_rfr_mse_Lst4,
test_gbr_mse_Lst4,
test_abr_mse_Lst4)), columns=["timestep",
"xgb", "LR-Ridge", "LR-Lasso",
"SGD Regressor",
"Bayesian Regressor",
"Random Forest Regressor",
"Gradient Boosting Regressor",
"ada Boost Regressor"
])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(15,6))
plt.axvline(x=6, color="red", linestyle="--")
plt.title("MSE Comparision")
plt.ylabel ("MSE Comparision")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,11))
plt.show()
This part shows us the best hyperparameters with highest accuracy of the models. We will grid search out the best three models shown above which are XGB, Gradient Boosting Regressor and Random Forest Regressor. Also, we will set up the timestep = 6 which offers the best performance so far based on the above analysis.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
gbr= GradientBoostingRegressor(random_state = 0)
grid_params = {
'n_estimators': [100, 200,300,500,800,1000],
"learning_rate": [0.001,0.01,0.1,1]
}
model = GridSearchCV(estimator=gbr,
param_grid=grid_params,
cv = 5,
#scoring = make_scorer(mean_squared_error, greater_is_better=False)
scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of Gradient Boosting Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
gbr= GradientBoostingRegressor(random_state = 0,
n_estimators = 100,
learning_rate=0.01
)
gbr.fit(xtrain, ytrain)
gbr_mse=mean_squared_error(ytest, gbr.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, gbr.predict(xtrain)))
print("Test Set MSE",gbr_mse)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
xgb= XGBRegressor(random_state = 0)
grid_params = {
'gamma':[i/10.0 for i in range(3,8)],
'subsample':[i/10.0 for i in range(5,10)],
}
model = GridSearchCV(estimator=xgb,
param_grid=grid_params,
cv = 5,
#scoring = make_scorer(mean_squared_error, greater_is_better=False)
scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of XGB Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
xgb= XGBRegressor(random_state = 0,
gamma= 0.7,
subsample= 0.9
)
xgb.fit(xtrain, ytrain)
xgb_mse=mean_squared_error(ytest, xgb.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, xgb.predict(xtrain)))
print("Test Set MSE",xgb_mse)
from sklearn.ensemble import RandomForestRegressor
xtrain,ytrain, train_date_Lst4, train_store_Lst4 = generate_x_y(train4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
xtest,ytest, test_date_Lst4, test_store_Lst4 = generate_x_y(test4[["date", "store_no", "store_size","type_A", "type_C","wkly_sales"]],
timestep=6, normalize=True)
rfr= RandomForestRegressor(random_state = 0)
grid_params = {
'max_depth': [10, 30, 50, 70, 90, 100, None],
"max_features" : ["auto", "sqrt"]
'n_estimators': [200, 400, 600, 800, 1000]}
model = GridSearchCV(estimator=rfr,
param_grid=grid_params,
cv = 5,
#scoring = make_scorer(mean_squared_error, greater_is_better=False)
scoring = "neg_mean_squared_error")
model.fit(xtrain, ytrain)
print("Best Parameters of RFR Regressor:",model.best_params_)
print("Best Score:",model.best_score_)
rfr= RandomForestRegressor(random_state = 0,
max_depth=10,
n_estimators=1000
)
rfr.fit(xtrain, ytrain)
rfr_mse = mean_squared_error(ytest, rfr.predict(xtest))
print("Train Set MSE",mean_squared_error(ytrain, rfr.predict(xtrain)))
print("Test Set MSE",rfr_mse)
The chart below illustrates that the three best models have very close MSE values. Among the regressors, The least error value which is generated by Random Forest model is about 0.0129; second place is Gradient Boosting; The third is XGB. Hyper-parameters which generated the optimizd result for the models are listed as below.
import matplotlib.pyplot as plt
%matplotlib inline
labels=["XGB ","Gradient Boosting", "Random Forest " ]
fig=plt.figure()
ax1= fig.add_subplot(1,1,1)
ax1.bar(x=labels,
height=[xgb_mse,gbr_mse, rfr_mse, ],
color=["red", "yellow", "blue"],
width = 0.5)
ax1.set_title("Regressor Error Comparison", fontsize=18)
ax1.set_ylabel("Mean_Square_Error", fontsize=15)
ax1.set_xlabel("Models", fontsize=15)
ax1.set_xticklabels(labels, ha="center")
plt.show()